COVID-19, which is short for coronavirus disease of 2019, is the illness caused by the SARS-CoV-2 virus first identified in Wuhan, China in December of 2019. Since then, the virus has rapidly spread across the world, leading the World Health Organization to declare a global pandemic. Millions of Americans have been infected by the virus, and hundreds of thousands have died due to the disease with those numbers only continuing to grow each day. A global race to develop a vaccine in record-breaking time ensued, with over 100 different candidates being tested across the globe. Despite multiple vaccines receiving emergency authorizations from multiple different nations, the situation is worsening daily as new mutant strains are being identified such as those identified in the United Kingdom. In the United States, public health officials are struggling to convince the populous that the vaccines are safe and effective, leading to widespread anti-vaccine protests seeking to slow the vaccination efforts, which only lends itself to give the virus more time to develop a mutation to defeat the current vaccine formulations.
Thus, analyzing data related to COVID-19 is worthwhile since it will help people understand the overall situation and severity of the pandemic and arouse their interest in adopting protective measures like mask-wearing, social-distancing, and vaccination. In addition, analyzing this data may expose differences in the ability of different regulations between states to contain the virus, which may prove beneficial in helping state governments are only utilizing restrictions that truly work to contain this pathogen.
The COVID-19 Data Repository by the Center for System Science and Engineering (CSSE) at Johns Hopkins University is compiled from sources such as, but not limited to, the World Health Organization and the United States Centers for Disease Control and Prevention (a list of all data sources is provided in the README.md file of the repository) provides case and deaths counts for each state/U.S. territory for each day since the SARS-CoV-2 virus was first detected in Washington state in January of 2020. This data set has been known to provide some of the most up-to-date information possible, which has resulted in many different organizations citing this data as trustworthy and reliable.
| UID | iso2 | iso3 | code3 | FIPS | Admin2 | Province_State | Country_Region | Lat | Long_ | Combined_Key | X1.22.20 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 84001001 | US | USA | 840 | 1001 | Autauga | Alabama | US | 32.53953 | -86.64408 | Autauga, Alabama, US | 0 |
| 84001003 | US | USA | 840 | 1003 | Baldwin | Alabama | US | 30.72775 | -87.72207 | Baldwin, Alabama, US | 0 |
| 84001005 | US | USA | 840 | 1005 | Barbour | Alabama | US | 31.86826 | -85.38713 | Barbour, Alabama, US | 0 |
| 84001007 | US | USA | 840 | 1007 | Bibb | Alabama | US | 32.99642 | -87.12511 | Bibb, Alabama, US | 0 |
| 84001009 | US | USA | 840 | 1009 | Blount | Alabama | US | 33.98211 | -86.56791 | Blount, Alabama, US | 0 |
| 84001011 | US | USA | 840 | 1011 | Bullock | Alabama | US | 32.10031 | -85.71266 | Bullock, Alabama, US | 0 |
| UID | iso2 | iso3 | code3 | FIPS | Admin2 | Province_State | Country_Region | Lat | Long_ | Combined_Key | Population |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 84001001 | US | USA | 840 | 1001 | Autauga | Alabama | US | 32.53953 | -86.64408 | Autauga, Alabama, US | 55869 |
| 84001003 | US | USA | 840 | 1003 | Baldwin | Alabama | US | 30.72775 | -87.72207 | Baldwin, Alabama, US | 223234 |
| 84001005 | US | USA | 840 | 1005 | Barbour | Alabama | US | 31.86826 | -85.38713 | Barbour, Alabama, US | 24686 |
| 84001007 | US | USA | 840 | 1007 | Bibb | Alabama | US | 32.99642 | -87.12511 | Bibb, Alabama, US | 22394 |
| 84001009 | US | USA | 840 | 1009 | Blount | Alabama | US | 33.98211 | -86.56791 | Blount, Alabama, US | 57826 |
| 84001011 | US | USA | 840 | 1011 | Bullock | Alabama | US | 32.10031 | -85.71266 | Bullock, Alabama, US | 10101 |
Admin2: name of county/political subdivision of U.S. state/territoryProvince_State: name of U.S. state/territoryXmm.dd.yy: one feature per day since the SARS_CoV_2 virus was first detected in the United States representing the case/death count of the county/political subdivision definied by the Admin2 feature; takes the format of Xmm.dd.yy where mm is the one- or two-digit month as a decimal, dd is the one- or two-digit day of the month as a decimal, and yy is the two-digit year without century as a decimalThe Homeland Infrastructure Foundation-Level Data Hospitals (HIFLD Hospitals) data set published by the United States Department of Homeland Security and compiled from sources from the United States Department of Health & Human Services and Centers for Disease Control and Prevention provides a list of all hospitals in the United States and their associated trauma level. It identifies how many hospitals and of what type exist in each state.
| NAME | STATE | TYPE | BEDS | TRAUMA |
|---|---|---|---|---|
| CENTRAL VALLEY GENERAL HOSPITAL | CA | GENERAL ACUTE CARE | 49 | NA |
| LOS ROBLES HOSPITAL & MEDICAL CENTER - EAST CAMPUS | CA | GENERAL ACUTE CARE | 62 | NA |
| EAST LOS ANGELES DOCTORS HOSPITAL | CA | GENERAL ACUTE CARE | 127 | NA |
| SOUTHERN CALIFORNIA HOSPITAL AT HOLLYWOOD | CA | GENERAL ACUTE CARE | 100 | NA |
| KINDRED HOSPITAL BALDWIN PARK | CA | GENERAL ACUTE CARE | 95 | NA |
| LAKEWOOD REGIONAL MEDICAL CENTER | CA | GENERAL ACUTE CARE | 172 | NA |
STATE: two-letter U.S.P.S. abbreviation of state nameTYPE: type of hospital; value can be "GENERAL ACUTE CARE", "CRITICAL ACCESS", "PSYCHIATRIC", "LONG TERM CARE", "REHABILITATION", "MILITARY", "SPECIAL", "CHILDREN", "WOMEN", or "CHRONIC DISEASE"STATUS: current status of hospital; value either "OPEN" or "CLOSED"LATITUDE: latitude of hospitalLONGITUDE: longitude of hospitalBEDS: number of beds available at hospital; value of -999 represents an unknown count of bedsTRAUMA: non-standard trauma center level identifier (definitions can be found in the HIFLD Trauma Levels Data Set); value of "NOT AVAILABLE" indicates the hospital is not classified as a trauma centerThe NYT Mask-Wearing Survey data set contains estimates of mask-usage from 250,000 survey responses for each county in the US. Each participant was asked “How often do you wear a mask in public when you expect to be within six feet of another person?” and given the choices of never, rarely, sometimes, frequently, or always. The survey was done in 2020 from July 2 to July 14, and was assembled by The New York Times and Dynata.
| COUNTYFP | NEVER | RARELY | SOMETIMES | FREQUENTLY | ALWAYS |
|---|---|---|---|---|---|
| 1001 | 0.053 | 0.074 | 0.134 | 0.295 | 0.444 |
| 1003 | 0.083 | 0.059 | 0.098 | 0.323 | 0.436 |
| 1005 | 0.067 | 0.121 | 0.120 | 0.201 | 0.491 |
| 1007 | 0.020 | 0.034 | 0.096 | 0.278 | 0.572 |
| 1009 | 0.053 | 0.114 | 0.180 | 0.194 | 0.459 |
| 1011 | 0.031 | 0.040 | 0.144 | 0.286 | 0.500 |
The COUNTYFP column is the FIPS code for the county, and the rest of the columns are estimates for the percent of people in that county who responded with that option. Those values always add up to about one.
The COVID-19 Vaccinations in the United States data set contains number of vaccine doses administered by state. Data on COVID-19 vaccine doses administered in the United States are collected by vaccination providers and reported to CDC through multiple sources, including jurisdictions, pharmacies, and federal entities, which use various reporting methods, including Immunization Information Systems, Vaccine Administration Management System, and direct data submission.
| State | Total_Doses_Administered | Doses_Administered_per_100k | X18._Doses_Administered | X18._Doses_Administered_per_100K | Ratio_Doses_Administered |
|---|---|---|---|---|---|
| Alaska | 239927 | 32797 | 238872 | 43308 | 0.32797 |
| Alabama | 815108 | 16624 | 814893 | 21361 | 0.16624 |
| Arkansas | 540192 | 17900 | 540003 | 23300 | 0.17900 |
| American Samoa | 18816 | 33788 | 18600 | 42821 | 0.33788 |
| Arizona | 1525794 | 20962 | 1524293 | 27034 | 0.20962 |
| Bureau of Prisons | 52743 | NA | 52740 | NA | NA |
Total doses administered column is the total number of vaccine doses that have been given to people.
Doses administered per 100k column is the total number of vaccine doses given for every 100,000 people.
18+ Doses Administered column is the total number of vaccine doses that have been given to people for the overall population
18+ Doses administered per 100k column is the total number of vaccine doses given for every 100,000 people aged 18 years and older.
The Infection rates before and after stay at home orders went into effect set contains a list of each state and the date on which the first stay at home order was put into effect. It also has infection rates for days before and after the enstatement of these orders. Infection rates were calculated using daily COVID-19 daily cases collected by Johns Hopkins Center for Health Security.
| State | Order.date | Infection.rate.and.confidence.interval..before.order. | Infection.rate.and.confidence.interval..after.order. |
|---|---|---|---|
| Alabama | 4/4/20 | 0.099 (0.088, 0.109) | 0.042 (0.039, 0.045) |
| Alaska | 3/28/20 | 0.11 (0.095, 0.126) | 0.03 (0.027, 0.032) |
| Arizona | 3/31/20 | 0.134 (0.124, 0.143) | 0.03 (0.025, 0.036) |
| California | 3/19/20 | 0.084 (0.077, 0.091) | 0.055 (0.05, 0.06) |
| Colorado | 3/26/20 | 0.11 (0.1, 0.121) | 0.04 (0.035, 0.044) |
| Connecticut | 3/23/20 | 0.154 (0.136, 0.172) | 0.065 (0.059, 0.07) |
State column is the state abbreviation for each state where data was available in the U.S.
Order.date column is the date on which the first stay at home order was put into effect.
Infection.rate.and.confidence.interval.before.order column is the infection rate and confidence interval for this rate for the day before the order went into effect
Infection.rate.and.confidence.interval.after.order column is the infection rate and confidence interval of this rate for the day after the order went into effect.
Between 2020-01-22 to 2021-02-24, 2.8336097^{7} total cases of COVID-19 have been detected in the United States and 5.0589^{5} total deaths have been ruled as being caused by COVID-19.
| date | total_cases | total_deaths | |
|---|---|---|---|
| Min. :2020-01-22 | Min. : 1 | Min. : 0 | |
| 1st Qu.:2020-04-30 | 1st Qu.: 1107214 | 1st Qu.: 67774 | |
| Median :2020-08-08 | Median : 5022981 | Median :163216 | |
| Mean :2020-08-08 | Mean : 7786083 | Mean :177519 | |
| 3rd Qu.:2020-11-16 | 3rd Qu.:11337674 | 3rd Qu.:249572 | |
| Max. :2021-02-24 | Max. :28336097 | Max. :505890 |
As seen in the distributions of cases and deaths by state, California and Texas both appear as outliers with higher numbers of both cases and deaths. However, when the population of these states is taken into account, it begins to provide a possible explanation of the higher numbers found in these states. Additionally, the epidemiologic data suggests that mutated variants of the SARS-CoV-2 that are more infectious and transmissible may be to blame for the high number of cases in these states.
As seen in the above visualizations of the geographic distributions of hospitals and trauma centers in the United States, health care institutions tend to be located around population centers. The distributions also show that larger states with larger populations have more hospitals and trauma centers, and are more likely to have lower level trauma centers. Additionally, lower level trauma centers, on average, have more beds for patients that facilities with a higher trauma level.
| BEDS | |
|---|---|
| Min. : 2.0 | |
| 1st Qu.: 30.0 | |
| Median : 89.0 | |
| Mean : 159.4 | |
| 3rd Qu.: 223.0 | |
| Max. :1592.0 | |
| NA’s :188 |
As seen in the box plot, there are quite a few outliers when it comes to the distribution of beds among trauma center levels. This is likely due to the different populations of different regions, as facilities in more highly-populated areas will need more beds for patients than those in rural areas. It is likely that trauma centers are created based not on population, but rather, geographic distance to another facility able to provide the same level of care.
Grouped by counties, an average of 51% of the responses are “Always,” and an average of 8% of the responses are “Never.” For a single county, the values for each response are supposed to sum to one. In reality, the values are rounded to three decimal places, so the sum for each county ranges from 0.998 to 1.002.
| NEVER | RARELY | SOMETIMES | FREQUENTLY | ALWAYS | sum | |
|---|---|---|---|---|---|---|
| Min. :0.00000 | Min. :0.00000 | Min. :0.0010 | Min. :0.0290 | Min. :0.1150 | Min. :0.998 | |
| 1st Qu.:0.03400 | 1st Qu.:0.04000 | 1st Qu.:0.0790 | 1st Qu.:0.1640 | 1st Qu.:0.3932 | 1st Qu.:1.000 | |
| Median :0.06800 | Median :0.07300 | Median :0.1150 | Median :0.2040 | Median :0.4970 | Median :1.000 | |
| Mean :0.07994 | Mean :0.08292 | Mean :0.1213 | Mean :0.2077 | Mean :0.5081 | Mean :1.000 | |
| 3rd Qu.:0.11300 | 3rd Qu.:0.11500 | 3rd Qu.:0.1560 | 3rd Qu.:0.2470 | 3rd Qu.:0.6138 | 3rd Qu.:1.000 | |
| Max. :0.43200 | Max. :0.38400 | Max. :0.4220 | Max. :0.5490 | Max. :0.8890 | Max. :1.002 |
There doesn’t seem to be any significant outliers. This is probably because there were 250,000 survey responses for a survey with only 5 options. Any individual county would have to have a lot of different responses in order to be able to become an outlier. Also, there is less chance for outliers because this data set was grouped into counties, forcing all of the columns for each row to sum to one. There are no NA values, and it seems to have data for almost every county.
By Feb 22th, there are 68150728 people in the US got vaccination. Grouped by states, there are an average of 21242 per 100,000 (21.2415%) of population in the US given doses. The number of doses administered per 100,000 ranges from 11767 to 39499.
| Total_Doses_Administered | Doses_Administered_per_100k | X18._Doses_Administered | X18._Doses_Administered_per_100K | |
|---|---|---|---|---|
| Min. : 7073 | Min. :11767 | Min. : 7073 | Min. :15081 | |
| 1st Qu.: 241471 | 1st Qu.:18891 | 1st Qu.: 240832 | 1st Qu.:24127 | |
| Median : 614928 | Median :19881 | Median : 614420 | Median :25428 | |
| Mean :1097822 | Mean :21231 | Mean :1096961 | Mean :27224 | |
| 3rd Qu.:1396224 | 3rd Qu.:22824 | 3rd Qu.:1395704 | 3rd Qu.:28548 | |
| Max. :7728120 | Max. :39499 | Max. :7724412 | Max. :50641 |
The most significant outlier in the data set is the total vaccination population in California. The possible reason might be overall education level in that states is high and also the population base in CA is large so that there are a great number of people taking the vaccine.